-
Notifications
You must be signed in to change notification settings - Fork 542
feat: add framework for File Format Options #3794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Corwin Joy <[email protected]>
|
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
|
Note that fully supporting Parquet encryption requires being able to get write and read properties per-file, which is why the existing ability to set |
|
I have marked this pull request as draft. This does not compile as is, I can come back to it once it is able to compile and pass unit tests |
Signed-off-by: Corwin Joy <[email protected]>
@rtyler OK. It seems that when I auto-merged the main branch it introduced a build error. I have resolved this and the code is once again building and passing unit tests. |
ion-elgreco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the benefit but we really need to reduce the surface of change that are being introduced
|
@corwinjoy - awesome to see this come to fruition! Will find some time to give this a review hopefully tomorrow. At first glance one quick question. Do we see a way to "bundle" the datafusion specific stuff a bit more? It's a bit hard to keep track of all the individual flags while reviewing :) |
What we did to minimize this dependency is define an abstract There might be some ways to refine this further, but in general we've tried to isolate and abstract these file properties where possible and not require datafusion. |
|
@roeap From a user point of view, we've tried hard to make the settings as easy as possible. This can be seen in Calling |
# Conflicts: # crates/core/src/delta_datafusion/table_provider.rs # crates/core/src/operations/delete.rs # crates/core/src/operations/drop_constraints.rs # crates/core/src/operations/filesystem_check.rs # crates/core/src/operations/load.rs # crates/core/src/operations/merge/mod.rs # crates/core/src/operations/mod.rs # crates/core/src/operations/optimize.rs # crates/core/src/operations/restore.rs # crates/core/src/operations/update.rs # crates/core/src/operations/write/mod.rs # crates/core/tests/command_optimize.rs # crates/core/tests/integration_datafusion.rs
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
# Conflicts: # crates/core/src/operations/optimize.rs
Signed-off-by: Corwin Joy <[email protected]>
…s_scan functions. Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
…Builder Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
# Conflicts: # Cargo.toml # crates/core/src/operations/delete.rs # crates/core/src/operations/optimize.rs # crates/core/src/operations/update.rs # crates/core/src/operations/write/execution.rs # crates/core/src/operations/write/writer.rs # crates/core/src/table/builder.rs
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
…ies. Signed-off-by: Corwin Joy <[email protected]>
|
@ion-elgreco OK. I have applied the changes you suggested from two weeks ago and re-merged the latest from main. So, I am ready for another review when you get the chance. Thanks again for the suggestions! |
|
|
||
| #[cfg(feature = "datafusion")] | ||
| #[derive(Clone, Debug, Default)] | ||
| pub struct SimpleFileFormatOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose for this for other delta-rs users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpleFileFormatOptions is a helper for users to easily create a FileFormatRef from TableOptions in order to set their read and write properties. We use it like:
let file_format_options = Arc::new(SimpleFileFormatOptions::new(tbl_options)) as FileFormatRef;
This is needed if users want to set their own TableOptions for reading and writing. In our examples, we only use this for encryption. But the idea is that users may set their own parquet options at the DeltaTable level. This default class is the easiest way to do it, holding TableOptions for reading and writing.
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
…ncy error. Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
Move KMS class from file_format_options.rs to new file kms_encryption.rs Signed-off-by: Corwin Joy <[email protected]>
4b9b5d8 to
e44dbf8
Compare
Signed-off-by: Corwin Joy <[email protected]>
|
@ion-elgreco OK. I have added changes to address your latest requests. |
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
|
I had thought about doing this with an encryption feature gate, but left it out for this PR for a few reasons:
|
Description
This PR adds encryption support and other advanced file options to
delta-rsby implementing a comprehensive framework for file format settings. The changes enable users to configure encryption settings, customize writer properties, and apply file-level formatting options when reading and writing Delta tables.FileFormatOptionstrait and related infrastructure to handle file-specific configurationsIn general, we have added a new trait called
FileFormatOptionsat the rootDeltaTablelevel to unify how files within a delta table are read and written with specific formatting. The idea is that you can apply these settings once, at the top level, and then seamlessly perform any operations with the necessary settings.This PR leverages the DataFusion
TableOptionsstructure to support format options for multiple underlying file formats. (The idea being thatdelta-rsmay eventually want to support storage formats beyond Parquet, such as Vortex or Lance.) Additionally, it centralizes file format options in a single, consistent location. This avoids the current difficulties where one has to separately setWriterProperties; then reader properties as part of theSessionState. (This is in line with comments from @roeap about how file configuration might be improved: #3300 (comment)). We would also like to eventually extend this upgrade to add notations about these file configurations to the delta table properties. For example, if the files are encrypted, one could add a KMS configuration for where to retrieve encryption keys.Review Suggestion
This PR turned out to be larger than we hoped, so apologies for that, but I don't know how to split it into smaller pieces.
When reviewing, we suggest starting with the file
crates/core/src/table/file_format_options.rsto get an overview of the new file format trait that can be applied to delta tables.Related Issue(s)
Support Parquet Modular Encryption:
#3300
Documentation
Parquet Modular Encryption: https://docs.google.com/document/d/1MUg1J7u5VdLkgejJ4ybzfZt1OmwhQkq2iGPxsn4gqLI/edit?tab=t.0#heading=h.34wvmhc1zdch
Attribution
This PR was created in collaboration with @adamreeve